Final project

Goal: Demonstrate that you know how to do data analysis in R

Minimum requirements:

1 R Markdown file and 1 HTML file
Use a dataset that we have not used in class
“Introduction”, “Data analysis” and “Conclusion” sections
At least 3 data visualizations (not all of the same type)
Examples on the class website
Due Nov 2 (Fri), 23:59:59

Project proposal

1-2 paragraphs long
Details on the problem you wish to explore, datasets you will use, potential visualizations
Due Oct 19 (Fri), 23:59:59

Recap of week 1

Basics of R
Data structures
- Homoegeneous: vectors & matrices
- Hetergeneous: list & data frames
Functions & packages

Vectors

vec <- c("a", "b", "c")
vec

## [1] "a" "b" "c"

vec[c(2,4)]

## [1] "b" NA

Lists

classes <- list(quarter = "Fall 2018/19",
             ID = c("STATS 32", "STATS 101", "STATS 200"),
             credits = 12)
classes$ID

## [1] "STATS 32"  "STATS 101" "STATS 200"

classes[["credits"]]

## [1] 12

Data frames

A special type of list:

list keys are variable names of the dataset
list values are all vectors of the same length (no. of observations)

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Getting a feel for your data

str, summary
head, tail
names, dim, nrow, ncol
table
mean, median, sd, var
factor

Weird thing that happened last time…

I want all the rows such that the value of the cyl column is equal to 2:

vehicles[vehicles$cyl == 2, ]

Small example

df

##    A    B
## 1  1    a
## 2  2    b
## 3  3    c
## 4 NA    d
## 5 NA <NA>

df$A == 2

## [1] FALSE  TRUE FALSE    NA    NA

df[df$A == 2, ]

##       A    B
## 2     2    b
## NA   NA <NA>
## NA.1 NA <NA>

Small example: Fix

Fix 1: test that the value is not NA and is equal to 2

df[!is.na(df$A) & df$A == 2, ]

##   A B
## 2 2 b

Fix 2: use the which function

which(df$A == 2)

## [1] 2

df[which(df$A == 2), ]

##   A B
## 2 2 b

Function syntax

Function name
Parentheses, and
A list of arguments within the parentheses
- Options that change what the function does slightly

E.g. Take the mean of c(1,3,NA).

mean(c(1,3,NA))

## [1] NA

mean(c(1,3,NA), na.rm = TRUE)

## [1] 2

Agenda for today

Different kinds of plots
Plotting with ggplot2 (and the + syntax)

Words vs. pictures

“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6
## 11 17.8  3.440         6
## 12 16.4  4.070         8
## 13 17.3  3.730         8
## 14 15.2  3.780         8
## 15 10.4  5.250         8
## 16 10.4  5.424         8
## 17 14.7  5.345         8
## 18 32.4  2.200         4
## 19 30.4  1.615         4
## 20 33.9  1.835         4
## 21 21.5  2.465         4
## 22 15.5  3.520         8
## 23 15.2  3.435         8
## 24 13.3  3.840         8
## 25 19.2  3.845         8
## 26 27.3  1.935         4
## 27 26.0  2.140         4
## 28 30.4  1.513         4
## 29 15.8  3.170         8
## 30 19.7  2.770         6
## 31 15.0  3.570         8
## 32 21.4  2.780         4

Two classes of variables in statistics

Continuous variable: Variable takes on values which fall on the real number line (or part of it)
- E.g. height, exam score, attendance count
Categorical variable: Variable takes on values which fall into discrete categories
- E.g. ice-cream flavor, country of origin

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

Not so good…

Easier to see the trend

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

We can combine multiple plots in one graphic

Summary

1 categorical variable: barplot
1 continuous variable: histogram
Continuous vs. continuous: scatterplot
Continuous vs. time: lineplot
Continuous vs. categorical: boxplots & violin plots

Case study

I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?

Data visualization in R: 2 broad approaches

base R

`ggplot2`

How can we describe a graphic?

The Grammar of Graphics, by Leland Wilkinson (1999)
Motivating example: language
- Grammar: “a formal system of rules for generating lawful statements in a language”
Can we develop similar rules for generating plots? Yes!
Refined by Hadley Wickham’s “A Layered Grammar of Graphics” (2010)
- operationalized as ggplot2 package
- “official” cheat sheet available
- ggplot2 reference manual

Hadley Wickham

3 essential elements of graphics: data, geometries, aesthetics

Data: Dataset we are using for the plot

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6

3 essential elements of graphics: data, geometries, aesthetics

Geometries: Visual elements used for our data

E.g. point, line, histogram, bar, boxplot

Geom: point

3 essential elements of graphics: data, geometries, aesthetics

Aesthetics: Defines the data columns which affect various aspects of the geom

E.g. x, y, color, fill, size, alpha, line type, line width
Which aesthetics you use depend on the geometries you choose

3 different aesthetics:

x-axis: weight
y-axis: mpg
color: cylinders
shape, size, etc. take on default values, not determined by data

Examples of other aesthetics

x-axis: weight
y-axis: mpg
size: cylinders
alpha: weight

Examples of other aesthetics

x-axis: weight
y-axis: mpg
color: cylinders
shape: cylinders

Combining multiple plots into one graphic: Layers

We can have more than one layer in a graphic.

= +

Each layer contains (essentially):

1 dataset, 1 geometric object, aesthetic mappings

`ggplot2` code: take 1

Making use of ggplot’s sensible defaults:

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg))

`ggplot2` code: take 2

Using jitter to avoid “overplotting”:

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg), 
               position = "jitter")

`ggplot2` code: take 3

When layers share attributes, we only have to type them once:

ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

Today’s dataset: Diamonds

What makes an expensive diamond?

(Source: USA TODAY)

Optional material

Full specification of a graphic

One graphic contains:

1 or more layers
- Each layer has 1 dataset, 1 geometric object, aesthetic mappings, 1 statistic (default usually ok), 1 position (default usually ok)
1 scale for each aesthetic mapping (defaults usually ok)
1 coordinate system (default usually ok)
facet specification (if any)

Other grammatical elements: statistics

Behind the scenes, R may need to do some transformation on the dataset to make the graphic.

Each geometry has a default statistic, usually good enough

Other grammatical elements: position

Sometimes we need to tweak the position of the geometric elements because they obscure each other.

E.g. jitter: randomly shifting points slightly

Only 9 data points??

Much better

Other grammatical elements: facets

Plotting different parts of our data on different canvases
We can facet by rows and/or columns

Other grammatical elements: scales

Aesthetics only tell you which column corresponds with which aesthetic (e.g. cylinder -> color)
Does not tell you which color should represent which cylinder value
Scales define that for you

Examples of scales (Source: A Layered Grammar of Graphics)

Scales example: colors

Default colors

Manually chosen colors

Scales example: x- & y-axes

Default axis limits

Manually chosen axis limits

Other grammatical elements: themes

Refers to all non-data ink

Titles, axis ticks & labels, background color, legend, etc.
Can manually set each item, or use preset themes

ggplot2’s default theme

Minimal theme

More pre-set themes

Classic theme

Dark theme

Colors in R

By name: e.g. “blue”, “red”, “black”, “white” (full list here)
By RGB value: e.g. rgb(0,0,1), rgb(1,0,0), rgb(0,0,0), rgb(1,1,1)
By hexadecimal value: e.g “#0000FF”, “#FF0000”, “#000000”, “#FFFFFF”

STATS 32 Session 3: Data Visualization

Final project

Project proposal

Recap of week 1

Vectors

Lists

Data frames

Getting a feel for your data

Weird thing that happened last time…

Small example

Small example: Fix

Function syntax

Agenda for today

Words vs. pictures

Two classes of variables in statistics

Barplots: counts for a categorical variable

Histograms: counts for a continuous variable

Scatterplots: continuous variable vs. continuous variable

Lineplots: continuous variable vs. time variable

Boxplots & violin plots: continuous variable vs. categorical variable

We can combine multiple plots in one graphic

We can combine multiple plots in one graphic

Summary

Case study

Data visualization in R: 2 broad approaches

base R

ggplot2

How can we describe a graphic?

3 essential elements of graphics: data, geometries, aesthetics

3 essential elements of graphics: data, geometries, aesthetics

3 essential elements of graphics: data, geometries, aesthetics

Examples of other aesthetics

Examples of other aesthetics

Combining multiple plots into one graphic: Layers

ggplot2 code: take 1

ggplot2 code: take 2

ggplot2 code: take 3

Today’s dataset: Diamonds

Full specification of a graphic

Other grammatical elements: statistics

Other grammatical elements: position

Other grammatical elements: facets

Other grammatical elements: scales

Scales example: colors

Scales example: x- & y-axes

Other grammatical elements: themes

More pre-set themes

Shapes in R

Colors in R

Color scales in R

`ggplot2`

`ggplot2` code: take 1

`ggplot2` code: take 2

`ggplot2` code: take 3